Lending Club Project¶

Lending Club is a marketplace for personal loans that matches borrowers who are seeking a loan with investors looking to lend money and make a return.

The data set contains approved loans from 2007 to 2011.

I'll predict if a borrower will pay off their loan on time or no ?

1) Data Cleaning¶

In [1]:

import pandas as pd

# Remove the First line
loans_2007 = pd.read_csv('LoanStats3a.csv', skiprows=1)
# Remove all columns containing more than 50% missing values
half_count = len(loans_2007) / 2
loans_2007 = loans_2007.dropna(thresh=half_count, axis=1)
# Remove useless columns
loans_2007 = loans_2007.drop(['desc', 'url'],axis=1)
loans_2007.to_csv('loans_2007.csv', index=False)

/Users/comalada/anaconda/lib/python3.5/site-packages/IPython/core/interactiveshell.py:2717: DtypeWarning: Columns (0,47) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

In [2]:

loans_2007 = pd.read_csv('loans_2007.csv')
print(loans_2007.head())
print(len(loans_2007.columns))

        id  member_id  loan_amnt  funded_amnt  funded_amnt_inv        term  \
0  1077501  1296599.0     5000.0       5000.0           4975.0   36 months   
1  1077430  1314167.0     2500.0       2500.0           2500.0   60 months   
2  1077175  1313524.0     2400.0       2400.0           2400.0   36 months   
3  1076863  1277178.0    10000.0      10000.0          10000.0   36 months   
4  1075358  1311748.0     3000.0       3000.0           3000.0   60 months   

  int_rate  installment grade sub_grade    ...    last_pymnt_amnt  \
0   10.65%       162.87     B        B2    ...             171.62   
1   15.27%        59.83     C        C4    ...             119.66   
2   15.96%        84.33     C        C5    ...             649.91   
3   13.49%       339.31     C        C1    ...             357.48   
4   12.69%        67.79     B        B5    ...              67.79   

  last_credit_pull_d collections_12_mths_ex_med  policy_code application_type  \
0           Nov-2016                        0.0          1.0       INDIVIDUAL   
1           Oct-2016                        0.0          1.0       INDIVIDUAL   
2           Nov-2016                        0.0          1.0       INDIVIDUAL   
3           Apr-2016                        0.0          1.0       INDIVIDUAL   
4           Nov-2016                        0.0          1.0       INDIVIDUAL   

  acc_now_delinq chargeoff_within_12_mths delinq_amnt pub_rec_bankruptcies  \
0            0.0                      0.0         0.0                  0.0   
1            0.0                      0.0         0.0                  0.0   
2            0.0                      0.0         0.0                  0.0   
3            0.0                      0.0         0.0                  0.0   
4            0.0                      0.0         0.0                  0.0   

  tax_liens  
0       0.0  
1       0.0  
2       0.0  
3       0.0  
4       0.0  

[5 rows x 52 columns]
52

/Users/comalada/anaconda/lib/python3.5/site-packages/IPython/core/interactiveshell.py:2717: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Remove useless columns and columns which leak information from the future.¶

In [4]:

loans_2007 = loans_2007.drop(["id", "member_id", "funded_amnt", "funded_amnt_inv",
                              "grade", "sub_grade", "emp_title", "issue_d", "zip_code",
                              "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv",
                              "total_rec_prncp", "total_rec_int", "total_rec_late_fee", "recoveries",
                              "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"], axis = 1)

In [5]:

print(len(loans_2007.columns))

We passed from 52 columns to 32 columns to build the model.

Explore the Different values in the Target column¶

In [106]:

print(loans_2007["loan_status"].value_counts())

Fully Paid                                             33902
Charged Off                                             5658
Does not meet the credit policy. Status:Fully Paid      1988
Does not meet the credit policy. Status:Charged Off      761
Current                                                  201
Late (31-120 days)                                        10
In Grace Period                                            9
Late (16-30 days)                                          5
Default                                                    1
Name: loan_status, dtype: int64

Filter target column and replace target column by categorical values¶

"Fully Paid" and "Charged Off" are the targets.

"Charged Off" means the borrower can't refund its loan. We can notice there is class imbalance between the two class we want to predict.

In [6]:

print(len(loans_2007))
# We just take "Fully Paid" and "Charged Off" columns
loans_2007 = loans_2007[(loans_2007["loan_status"] == "Fully Paid") | (loans_2007["loan_status"] == "Charged Off")]

print(len(loans_2007))

mapping_dict = {
    "loan_status": {
        "Fully Paid": 1,
        "Charged Off": 0

    }
}
loans_2007 = loans_2007.replace(mapping_dict)
loans_2007["loan_status"].head()

42538
39560

Out[6]:

0    1
1    0
2    1
3    1
5    1
Name: loan_status, dtype: int64

In [108]:

loans_2007.head()

Out[108]:

	loan_amnt	term	int_rate	installment	emp_length	home_ownership	annual_inc	verification_status	loan_status	pymnt_plan	...	initial_list_status	last_credit_pull_d	policy_code	application_type
0	5000.0	36 months	10.65%	162.87	10+ years	RENT	24000.0	Verified	1	n	...	f	Nov-2016	1.0	INDIVIDUAL
1	2500.0	60 months	15.27%	59.83	< 1 year	RENT	30000.0	Source Verified	0	n	...	f	Oct-2016	1.0	INDIVIDUAL
2	2400.0	36 months	15.96%	84.33	10+ years	RENT	12252.0	Not Verified	1	n	...	f	Nov-2016	1.0	INDIVIDUAL
3	10000.0	36 months	13.49%	339.31	10+ years	RENT	49200.0	Source Verified	1	n	...	f	Apr-2016	1.0	INDIVIDUAL
5	5000.0	36 months	7.90%	156.46	3 years	RENT	36000.0	Source Verified	1	n	...	f	Jan-2016	1.0	INDIVIDUAL

5 rows × 32 columns

Remove single columns value¶

In [7]:

drop_columns = []

for col in loans_2007.columns:
    # unique returns also counts the Pandas missing value object nan as a value, we then should use dropna before
    length = len(loans_2007[col].dropna().unique())
    
    if length == 1:
        drop_columns.append(col)
    
loans_2007 = loans_2007.drop(drop_columns , axis = 1)

print(len(drop_columns))

In [8]:

loans_2007.to_csv('filtered_loans_2007.csv', index=False)

We removed 8 columns which only contained unique values.

2) Preparing Features¶

Calculate number of Null values¶

In [9]:

loans = pd.read_csv('filtered_loans_2007.csv')
null_counts = loans.isnull().sum()
print(null_counts)

loan_amnt                 0
term                      0
int_rate                  0
installment               0
emp_length                0
home_ownership            0
annual_inc                0
verification_status       0
loan_status               0
pymnt_plan                0
purpose                   0
title                    10
addr_state                0
dti                       0
delinq_2yrs               0
earliest_cr_line          0
inq_last_6mths            0
open_acc                  0
pub_rec                   0
revol_bal                 0
revol_util               50
total_acc                 0
last_credit_pull_d        2
pub_rec_bankruptcies    697
dtype: int64

Handling Missing Values¶

We'll remove rows with null values and columns with more than 1% of missing values.

In [10]:

loans = loans.drop(["pub_rec_bankruptcies"], axis = 1)
      
loans = loans.dropna(axis = 0)
print(loans.dtypes.value_counts())

# ReIndexing after removing missing values
loans = loans.reset_index(drop=True)

object     12
float64    10
int64       1
dtype: int64

Explore text columns¶

In [37]:

object_columns_df = loans.select_dtypes(include=['object'])
print(object_columns_df.head(1))

         term int_rate emp_length home_ownership verification_status  \
0   36 months   10.65%  10+ years           RENT            Verified   

  pymnt_plan      purpose     title addr_state earliest_cr_line revol_util  \
0          n  credit_card  Computer         AZ         Jan-1985      83.7%   

  last_credit_pull_d  
0           Nov-2016

--> Some columns seems to be categorical, we need to explore them with the number of unique values

Explore categorical columns¶

In [113]:

cols = ['home_ownership', 'verification_status', 'emp_length', 'term', 'addr_state']

for column in cols:
    print(loans[column].value_counts())

RENT        18780
MORTGAGE    17574
OWN          3045
OTHER          96
NONE            3
Name: home_ownership, dtype: int64
Not Verified       16856
Verified           12705
Source Verified     9937
Name: verification_status, dtype: int64
10+ years    8821
< 1 year     4563
2 years      4371
3 years      4074
4 years      3409
5 years      3270
1 year       3227
6 years      2212
7 years      1756
8 years      1472
9 years      1254
n/a          1069
Name: emp_length, dtype: int64
 36 months    29041
 60 months    10457
Name: term, dtype: int64
CA    7070
NY    3788
FL    2856
TX    2714
NJ    1838
IL    1517
PA    1504
VA    1400
GA    1393
MA    1336
OH    1208
MD    1049
AZ     874
WA     834
CO     786
NC     780
CT     747
MI     722
MO     682
MN     611
NV     492
SC     470
WI     453
AL     446
OR     445
LA     435
KY     325
OK     298
KS     269
UT     256
AR     243
DC     211
RI     198
NM     188
WV     176
HI     172
NH     172
DE     113
MT      84
WY      83
AK      79
SD      63
VT      54
MS      19
TN      17
IN       9
ID       6
IA       5
NE       5
ME       3
Name: addr_state, dtype: int64

These 5 columns contain categorical values

Explore 'Purpose' and 'Title' columns which look similar¶

In [114]:

print(loans["purpose"].value_counts())
print(loans["title"].value_counts())

debt_consolidation    18533
credit_card            5099
other                  3963
home_improvement       2965
major_purchase         2181
small_business         1815
car                    1544
wedding                 945
medical                 692
moving                  581
vacation                379
house                   378
educational             320
renewable_energy        103
Name: purpose, dtype: int64
Debt Consolidation                         2168
Debt Consolidation Loan                    1706
Personal Loan                               658
Consolidation                               509
debt consolidation                          502
Credit Card Consolidation                   356
Home Improvement                            354
Debt consolidation                          333
Small Business Loan                         322
Credit Card Loan                            313
Personal                                    308
Consolidation Loan                          255
Home Improvement Loan                       246
personal loan                               234
personal                                    220
Loan                                        212
Wedding Loan                                209
consolidation                               200
Car Loan                                    200
Other Loan                                  190
Credit Card Payoff                          155
Wedding                                     152
Major Purchase Loan                         144
Credit Card Refinance                       143
Consolidate                                 127
Medical                                     122
Credit Card                                 117
home improvement                            111
My Loan                                      94
Credit Cards                                 93
                                           ... 
DebtConsolidationn                            1
 Freedom                                      1
Credit Card Consolidation Loan - SEG          1
SOLAR PV                                      1
Pay on Credit card                            1
To pay off balloon payments due               1
Paying off the debt                           1
Payoff ING PLOC                               1
Josh CC Loan                                  1
House payoff                                  1
Taking care of Business                       1
Gluten Free Bakery in ideal town for it       1
Startup Money for Small Business              1
FundToFinanceCar                              1
getting ready for Baby                        1
Dougs Wedding Loan                            1
d rock                                        1
LC Loan 2                                     1
swimming pool repair                          1
engagement                                    1
Cut the credit cards Loan                     1
vinman                                        1
working hard to get out of debt               1
consolidate the rest of my debt               1
Medical/Vacation                              1
2BDebtFree                                    1
Paying Off High Interest Credit Cards!        1
Baby on the way!                              1
cart loan                                     1
Consolidaton                                  1
Name: title, dtype: int64

Convert features to Categorical columns¶

home_ownership, verification_status, emp_length, and term columns each contain a few discrete categorical values. We'll use dummy variables for columns containing categorical values, it split a column into separate binary columns. For int_rate and revol_util columns we need to remove the '%' and convert to float value.

Between purpose and title columns we select the purpose column because it contain less categorical values.

For the emp_length column we need to do some data engineering because it contain ordered values.

Columns containing date values would require a good amount of feature engineering for them to be potentially useful, so they need to be removed: earliest_cr_line, last_credit_pull_d

Finally addr_state contain too many discrete values.

In [11]:

mapping_dict = {
    "emp_length": {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        "n/a": 0
    }
}

loans = loans.drop(["last_credit_pull_d", "addr_state", "title", "earliest_cr_line","pymnt_plan"], axis = 1)

loans['int_rate'] = loans['int_rate'].str.rstrip('%').astype("float")
loans['revol_util'] = loans['revol_util'].str.rstrip('%').astype("float")

loans = loans.replace(mapping_dict)

loans.head()

Out[11]:

	loan_amnt	term	int_rate	installment	emp_length	home_ownership	annual_inc	verification_status	loan_status	purpose	dti	inq_last_6mths	open_acc	revol_bal	revol_util	total_acc
0	5000.0	36 months	10.65	162.87	10	RENT	24000.0	Verified	1	credit_card	27.65	1.0	3.0	13648.0	83.7	9.0
1	2500.0	60 months	15.27	59.83	0	RENT	30000.0	Source Verified	0	car	1.00	5.0	3.0	1687.0	9.4	4.0
2	2400.0	36 months	15.96	84.33	10	RENT	12252.0	Not Verified	1	small_business	8.72	2.0	2.0	2956.0	98.5	10.0
3	10000.0	36 months	13.49	339.31	10	RENT	49200.0	Source Verified	1	other	20.00	1.0	10.0	5598.0	21.0	37.0
4	5000.0	36 months	7.90	156.46	3	RENT	36000.0	Source Verified	1	wedding	11.20	3.0	9.0	7963.0	28.3	12.0

Dummy Variables¶

In [12]:

cat_columns = ["home_ownership", "verification_status", "emp_length", "purpose", "term"]

dummy_df = pd.get_dummies(loans[cat_columns])
loans = pd.concat([loans, dummy_df], axis=1)
loans = loans.drop(cat_columns, axis=1)

In [67]:

loans.head()

Out[67]:

	loan_amnt	int_rate	installment	annual_inc	loan_status	dti	inq_last_6mths	open_acc	...	purpose_other	purpose_small_business	purpose_wedding	term_ 36 months	term_ 60 months
0	5000.0	10.65	162.87	24000.0	1	27.65	1.0	3.0	...	0.0	0.0	0.0	1.0	0.0
1	2500.0	15.27	59.83	30000.0	0	1.00	5.0	3.0	...	0.0	0.0	0.0	0.0	1.0
2	2400.0	15.96	84.33	12252.0	1	8.72	2.0	2.0	...	0.0	1.0	0.0	1.0	0.0
3	10000.0	13.49	339.31	49200.0	1	20.00	1.0	10.0	...	1.0	0.0	0.0	1.0	0.0
4	5000.0	7.90	156.46	36000.0	1	11.20	3.0	9.0	...	0.0	0.0	1.0	1.0	0.0

5 rows × 37 columns

3) Making Predictions¶

In [13]:

print(loans.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39498 entries, 0 to 39497
Data columns (total 37 columns):
loan_amnt                              39498 non-null float64
int_rate                               39498 non-null float64
installment                            39498 non-null float64
annual_inc                             39498 non-null float64
loan_status                            39498 non-null int64
dti                                    39498 non-null float64
delinq_2yrs                            39498 non-null float64
inq_last_6mths                         39498 non-null float64
open_acc                               39498 non-null float64
pub_rec                                39498 non-null float64
revol_bal                              39498 non-null float64
revol_util                             39498 non-null float64
total_acc                              39498 non-null float64
home_ownership_MORTGAGE                39498 non-null float64
home_ownership_NONE                    39498 non-null float64
home_ownership_OTHER                   39498 non-null float64
home_ownership_OWN                     39498 non-null float64
home_ownership_RENT                    39498 non-null float64
verification_status_Not Verified       39498 non-null float64
verification_status_Source Verified    39498 non-null float64
verification_status_Verified           39498 non-null float64
purpose_car                            39498 non-null float64
purpose_credit_card                    39498 non-null float64
purpose_debt_consolidation             39498 non-null float64
purpose_educational                    39498 non-null float64
purpose_home_improvement               39498 non-null float64
purpose_house                          39498 non-null float64
purpose_major_purchase                 39498 non-null float64
purpose_medical                        39498 non-null float64
purpose_moving                         39498 non-null float64
purpose_other                          39498 non-null float64
purpose_renewable_energy               39498 non-null float64
purpose_small_business                 39498 non-null float64
purpose_vacation                       39498 non-null float64
purpose_wedding                        39498 non-null float64
term_ 36 months                        39498 non-null float64
term_ 60 months                        39498 non-null float64
dtypes: float64(36), int64(1)
memory usage: 11.1 MB
None

Classification with Logistic Regression & Cross Validation¶

In [20]:

from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_predict, KFold
lr = LogisticRegression()

cols = loans.columns
train_cols = cols.drop("loan_status")
features = loans[train_cols]
target = loans["loan_status"]

# 3 Folds by default
kf = KFold(features.shape[0], random_state=1)


predictions = cross_val_predict(lr, features, target, cv = kf)
predictions = pd.Series(predictions)

#loans = loans.reset_index(drop=True)

tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(loans[tn_filter])

tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(loans[tp_filter])

fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(loans[fn_filter])

fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(loans[fp_filter])

tpr = tp / (tp + fn)
fpr = fp / (fp + tn)

print("True Positive Rate (recall): " + str(tpr))
print("False Positive Rate (fall-out): " + str(fpr))

True Positive Rate (recall): 0.9991139726512892
False Positive Rate (fall-out): 0.9980492995211917

We should optimize for:

high recall (true positive rate) we want a lot a loans we could invest on.
low fall-out (false positive rate) we don't want to lose money on bad loans, it minize the risk.

Imbalanced Classes : Penalizing the Classifier with Class Weight¶

We can do this by setting the class_weight parameter to balanced. This tells scikit-learn to penalize the misclassification of the minority class during the training process. the penalty is set to be inversely proportional to the class frequencies.

For the classifier, correctly classifying a row where loan_status is 0 is 6 times more important than correctly classifying a row where loan_status is 1.

In [16]:

from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_predict
lr = LogisticRegression(class_weight = "balanced")
# 3 Folds by default
kf = KFold(features.shape[0], random_state=1)

predictions = cross_val_predict(lr, features, target, cv = kf)
predictions = pd.Series(predictions)

tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(loans[tn_filter])

tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(loans[tp_filter])

fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(loans[fn_filter])

fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(loans[fp_filter])

tpr = tp / (tp + fn)
fpr = fp / (fp + tn)

print(tpr)
print(fpr)

0.6516731149768156
0.380209256960454

We improved false positive rate

Manual Penalties : Increase penalty for misclassifying¶

In [21]:

from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_predict

penalty = {
    0: 10,
    1: 1
}
lr = LogisticRegression(class_weight=penalty)

kf = KFold(features.shape[0], random_state=1)

predictions = cross_val_predict(lr, features, target, cv = kf)
predictions = pd.Series(predictions)

tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(loans[tn_filter])

tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(loans[tp_filter])

fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(loans[fn_filter])

fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(loans[fp_filter])

tpr = tp / (tp + fn)
fpr = fp / (fp + tn)

print(tpr)
print(fpr)

0.22035500162438348
0.07820535555949637

While we have fewer false positives, we are also missing opportunities to make more money.

There is a TradeOff between them.

Random Forests¶

In [22]:

from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_predict

rf = RandomForestClassifier(class_weight="balanced", random_state=1)

kf = KFold(features.shape[0], random_state=1)

predictions = cross_val_predict(rf, features, target, cv = kf)
predictions = pd.Series(predictions)

tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(loans[tn_filter])

tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(loans[tp_filter])

fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(loans[fn_filter])

fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(loans[fp_filter])

tpr = tp / (tp + fn)
fpr = fp / (fp + tn)

print(tpr)
print(fpr)

0.9738621932130305
0.9409469764142578